Explainer notebook - The Foods-Network¶

Contribution statement:¶

Team members:

  • Jacob (s214596)
  • Kristoffer (s214609)
  • Karoline (s214638)

All members collaborated and contributed to every part of the assignment.

Relevant Links:¶

A project website has been deployed to present the findings caried out in the study.

Specifics can be found in the following Github repository.

Below is all relevant packages used to complete this project:

In [28]:
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product
import seaborn as sns
import numpy as np
from wordcloud import WordCloud
import ast
import requests
import altair as alt
from tqdm import tqdm
import networkx as nx
import json
from networkx.readwrite import json_graph
import vl_convert as vlc
import random
import matplotlib.colors as mcolors
from netwulf import visualize, draw_netwulf
import community
from wordcloud import WordCloud
from joblib import Parallel, delayed
import os # This is imported in order to store API key as environment variable
from openai import OpenAI
import random
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
from itertools import chain
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

Table of Contents¶

  • 1. Motivation

    • 1.1 Motivation and selection of data for the study
    • 1.2 The goal for the end user’s experience
  • 2. The data collection and basic stats

    • 2.1 Data collection and webscraping
    • 2.2 Preprocessing and Cleaning
      • 2.2.1 Overview and Cleaning the Product data
      • 2.2.2 Overview and cleaning of Textual data
      • 2.2.3 Basic stats of the final merged and cleaned data
    • 2.3 Creating the network
      • 2.3.1 Basic stats of the Network
  • 3. Tools, theory and analysis

    • 3.1 Network analysis
      • 3.1.1 Degree distribution
      • 3.1.2 Assortativity
      • 3.1.3 Communities
      • 3.1.4 Centrality
    • 3.2 Text analysis
      • 3.2.1 Tokenization and distribution of text length
      • 3.2.2 Frequency-Rank-Plot
      • 3.2.3 TF-IDF scores and Wordclouds
      • 3.2.4 Labeling of communities using OpenAI's API
  • 4. Discussion

  • 5. References

1. Motivation ¶

1.1 Motivation and selection of data for the study ¶

This project aims to uncover valuable insights into consumer behavior in Denmark by investigating a network of grocery items frequently purchased together in some of the country’s largest and most well-known grocery stores owned by the Salling Group. Understanding consumer behavior is essential, as it includes a broad and important field that integrates psychology, sociology, economics, and marketing [1]. In this project we will investigate this network using a social science approach in order to discover how consumer choices are reflected in what products gets bought together.

The data used in the project includes an id, name, price and a short description of common everyday food-products that is collected using scraping methods on the BilkaToGo website. The project utilizes the Salling Group API to gather information about items that are frequently bought together by calling for each product id. This particular data was chosen in order to create a network of food-items where an edge between two products occur if they are on the same 'frequently bough together' list. The project will conduct a thoughrough analysis of the network in order to get insights of the networks structure and investigate which types of food-items often land in the same basket and what kind og communities emerge based on what the average consumer decides to buy together when shopping.

The descriptions collected for each product are a small text-pieces of in average around 60 words typically explaining what the product tastes like, and what occasions/setting they are typically enjoyed in. This data is used to conduct a textual analysis using NLP tools to get insights into the communities and label them according to the information they hold.

1.2 The goal for the end user’s experience ¶

The Boston Consulting Group mentiones in a Danish consumer sentiment series that in 2023 "health was the most important purchasing criterion, with 42% of respondents considering health important when buying food." [2]. Additionally they mention that the sustainability trend is prominent in the groceries category, with 21% of consumers finding sustainability important when shopping for groceries. Trough the analysis this project aims to investigate if some of these patterns are recognisable in the network and shed some light on what types of products generally tend to get puchased together. Subsequently a goal is that the analysis will be reflecting consumer behaviors that are recognized in our daily life but also discover patterns we might hadn't initially though of.

2. The data collection and basic stats

2.1 Data collection and Webscraping ¶

To analyze consumer patterns, we need data of different items sold by the Salling Group. We utilize the Salling Group API specifically the frequently bought together endpoint. This requires product IDs, which is web scraped from the from the BilkaToGO where they originate. Because we need 3 types of data from the products, we have divided our webscraping into 3 different parts:

  1. Webscraping for productID's
  2. Webscraping for individual product descriptions
  3. Frequently bought together endpoint call

The BilkaToGO website is structured in a way that all categories fall into one of 21 categories that each has a dedicated page. Initially, we looped through each category link to scrape all products on these pages. However, the site's heavy reliance on JavaScript, tied to a backend database, made web scraping challenging with a more tideous and longsome process. To overcome this, we used the Python package 'Selenium' with a webdriver to interact with the pages. This process required opening each page and using CSS selectors to scrape the data.

This process took a long time to figure out and to run in order to get all productID's for the website. We uncovered ~32.000 products and for each page there are ~60 products. This means that we initialized the webdriver for each product page around 533 times for the first part. A thing to note about the dataset, is that the website is tied to a backend database and grocery prices and product change a lot over time, so the dataset of the products are specific for the date and time we scraped it. (The 17th of April 2024)

When running through this first process of datacollecting, we also gathered data about the name of the product, price, link and category.

The second part of the data collection was to gather the textual data for each product in form of a product description tied to each individual product.

1. Webscraping for productID's¶

In [34]:
BASE_LINK = "https://www.bilkatogo.dk"

def extract_products(url, category_list):

    products_df = pd.DataFrame(columns=['product_id', 'name', 'price', 'link', 'category'])
    
    for category in tqdm(category_list):
        page_counter = 0
        
        SCRAPE_LINK = f"{url}/{category}"   # Construct the URL for the first case with no page number
        product_card_containers = ["dummy"]
        
        while len(product_card_containers) != 0:
            print("The length of product_card_containers is: ", len(product_card_containers))
            print("The page counter is: ", page_counter)
            
            # Initialize Edge WebDriver
            options = webdriver.EdgeOptions()
            options.use_chromium = True
            options.add_argument('headless')  # To run Edge in headless mode
            driver = webdriver.Edge(options=options)

            # Load the webpage
            driver.get(SCRAPE_LINK)

            # Wait for a few seconds to ensure JavaScript execution
            time.sleep(0.5)

            # Get the page source after JavaScript execution
            page_source = driver.page_source

            # Close the WebDriver
            driver.quit()

            # Parse the HTML content
            soup = BeautifulSoup(page_source, 'html.parser')

            # Find the div element with specific attributes
            div_element = soup.find_all('div', {'data-v-da5161c2': True, 'data-v-e0535ac4': True})
            # Find all elements with product-card-container class
            product_card_containers = soup.find_all('div', {'class' : 'product-card-container'})
            

            #TODO:
            for product in product_card_containers:
                if product == 'dummy':
                    continue
                product_info = product.find('div')
                product_id = product_info.attrs['data-productid']
                product_name = product_info.contents[0]['aria-label']
                product_link = product_info.contents[0]['href']


                prod_price = product_info.find('div', {'class' : 'row product-description flex-column'})
                produc_price = prod_price.find('p', {'class' : 'description'})

                ### Added a logic for 'drikkevarer' as the category doesn't have the same price format as the others and follows a slight random pattern when displaying the price ###
                if category == "drikkevarer/":
                    for span in produc_price.find_all('span'):
                        if "/L." in span.text:
                            product_kg_price = span.string
                else:
                    product_kg_price = produc_price.find_all('span')[-1].string     # Extract the kg./stk. price from the product. Sufficient for all categories except 'drikkevarer'
            
                #price = [int(s) for s in test.split() if s.isdigit()]
                new_row = {'product_id' : product_id, 'name' : product_name, 'link' : product_link, 'price' : product_kg_price, 'category' : category}
                products_df.loc[len(products_df)] = new_row
            
            page_counter += 1
            SCRAPE_LINK = f"{url}/{category}/?page={page_counter}"
        
    return products_df
    


LINK = "https://www.bilkatogo.dk/kategori/"
categories = ["frugt-og-groent/", "koed-og-fisk/", "mejeri-og-koel/", "drikkevarer/", "broed-og-kager/", "kolonial/", "slik-og-snacks/", "frost/", "kiosk/", "dyremad/", "husholdning/",
                "personlig-pleje/", "baby-og-boern/", "bolig-og-koekken/", "fritid-og-sport/", "toej-og-sko/", "elektronik/", "have/", "leg/", "biludstyr/", "byggemarked/"]


prods = extract_products(LINK, ["drikkevarer/"])
prods.to_csv('data/df_Salling_Products.csv', sep=';', index=False, header=False)

2. Webscraping for individual product descriptions¶

In [ ]:
BASE_LINK = "https://www.bilkatogo.dk"
df1 = pd.DataFrame(columns=["p_id", "descriptions"])



def get_product_description(link):
    product_descriptions = []
    #for link in product_links:
    url = BASE_LINK + link  
    
    options = webdriver.EdgeOptions()
    options.use_chromium = True
    options.add_argument('headless')  # To run Edge in headless mode

    driver = webdriver.Edge(options=options)

    # Load the webpage
    driver.get(url)
    
    # Wait for a few seconds to ensure JavaScript execution
    time.sleep(2)

    # Get the page source after JavaScript execution
    page_source = driver.page_source

    # Close the WebDriver
    driver.quit()

    # Parse the HTML content
    soup = BeautifulSoup(page_source, 'html.parser')
    try:
        description = soup.find('section', {'id' : 'content-description', 'class': 'content'})
        a = description.find("h2")
        if a != None:
            description = ''.join([str(tag) for tag in reversed(list(description.h2.previous_siblings)) if tag.name != 'h2'])
            description = description.replace("<br/><br/>","")
        else:
            description = description.text
    except AttributeError:
        description = "No description available"

    product_descriptions.append(description)

    return product_descriptions





products = pd.read_csv('data/df_Salling_Products.csv', sep=";")
product_links = products["link"]
product_id = products["product_id"]


for i in tqdm(range(len(pd.read_csv('data\df_Salling_Products_Descriptions_CLEANED.csv', sep=";")), len(product_links), 14)):

    results = Parallel(n_jobs=14)(delayed(get_product_description)(link) for link in product_links[i:i+14])

    df1['p_id'] = list(product_id[i:i+14])
    df1["descriptions"] = sum(results, [])
    df1.to_csv('data/df_Salling_Products_Descriptions_CLEANED.csv', sep=';', mode='a', index=False, header=False)

3. Frequently bought together endpoint call¶

In [ ]:
df2 = pd.read_csv("data/df_Salling_Products_outer_categories.csv", sep=";")
df2 = list(df2.loc[df2["outer_category"] == "Foods"]["product_id"])

bearer_token = "6656b12a-84c5-48df-a174-4b3c1cd38614"

class BearerAuth(requests.auth.AuthBase):
    def __init__(self, token):
        self.token = token
    def __call__(self, r):
        r.headers["authorization"] = "Bearer " + self.token
        return r

df_neighbor = pd.DataFrame(columns=["product_id", "neighbor_products_id"])

def get_neighbours(product_id):
    neighbour_id = []
    url = "https://api.sallinggroup.com"
    version = "v1-beta"
    recourses = "product-suggestions/frequently-bought-together"
    PARAMETERS = {
        "productId" : f"{product_id}"
    }

    API_LINK = f"{url}/{version}/{recourses}"

    results = requests.get(API_LINK, params = PARAMETERS, auth=BearerAuth(bearer_token)).json()
    
    for neighbours in results:
        try:
            neighbour_id.append(int(neighbours['prod_id']))
        except:
            print("Error")
            if results['statusCode'] == 429:
                print("Sleeping the code... Error 429")
                time.sleep(20)
        # # except TypeError:
        # #     print("Sleeping the code... Error 429")
        # #     print(results['error'])
        # #     time.sleep(10)
        # except KeyError:
        #     neighbour_id.append('No product available')

    return neighbour_id
    
#total_entities = response.headers["X-Total-Count"]

time_step = 4

for i in tqdm(range(len(pd.read_csv("data/df_Salling_Products_Neighbours.csv", sep=";")), len(df2), time_step)):
    neighbor_list = []
    p_id_list = []
    time.sleep(time_step * 0.2)
    
    p_id_list.append(list(df2[i:i+time_step]))

    results = Parallel(n_jobs=time_step)(delayed(get_neighbours)(product_id) for product_id in df2[i:i+time_step])
    
    for neighbours in results:
        neighbor_list.append(neighbours)
        
    df_neighbor["product_id"] = sum(p_id_list, [])
    df_neighbor["neighbor_products_id"] = neighbor_list
    df_neighbor.to_csv('data/df_Salling_Products_Neighbours.csv', sep=';', mode='a', index=False, header=False)

2.2 Preprocessing and Cleaning ¶

In this section the data is filtered and cleaned to prepare it to be used in the analysis. The preprocessing is handeled in 3 steps:

  1. Product data (ids, names, prices, categories)
  2. Textual data (product descriptions)
  3. Neighbors data (frequently bought together)

The pandas library is used to read the csv files for the webscraped data:

In [29]:
df_products = pd.read_csv('data/df_Salling_Products.csv' , delimiter=";")   # Read the csv file containing the products_id
df_text = pd.read_csv('data/df_Salling_Products_Descriptions_CLEANED.csv' , delimiter=";")  # Read the csv file containing the descriptions
df_neighbours = pd.read_csv('data/df_Salling_Products_Neighbours.csv', delimiter=";")   # Read the csv file containing the neighbours to a specific product_id

2.2.1 Overview and Cleaning the product_id data ¶

This section handles the product data that was scraped from from the BilkaToGO website.

In [30]:
# Cleaning product id data by stripping unnecessary characters ect.
df_products[['price_amount', 'unit']] = df_products['price'].str.split('/', expand=True)    # Cleaning price attribute
df_products['price_amount'] = df_products['price_amount'].str.replace(',', '.').str.extract('(\d+.\d+)').astype(float)  # Cleaning price attribute

df_products['category'] = df_products['category'].str.replace('/','')     # Cleaning category attribute 
df_products.drop(columns=['Unnamed: 0'], inplace=True)    # Drop the original price column

#Distribute the 21 different inner categories into 3 main categories: Foods, House and Other
foods = ['frugt-og-groent', 'koed-og-fisk', 'mejeri-og-koel', 'drikkevarer', 'broed-og-kager','kolonial', 'slik-og-snacks', 'frost']
house = ['husholdning', 'personlig-pleje', 'baby-og-boern', 'bolig-og-koekken', 'fritid-og-sport', 'toej-og-sko', 'have', 'leg', 'byggemarked']
other = ['dyremad','elektronik','biludstyr', 'kiosk']

df_products['outer_category'] = df_products['category'].map(lambda x: 'Foods' if x in foods else ('House' if x in house else 'Other'))

df_products.head()
Out[30]:
product_id name price link category price_amount unit outer_category
0 18381 Bananer 2,75/Stk. /produkt/bananer/18381/ frugt-og-groent 2.75 Stk. Foods
1 51061 Peberfrugter røde 7,25/Stk. /produkt/peberfrugter-roede/51061/ frugt-og-groent 7.25 Stk. Foods
2 61090 Agurk øko 9,00/Stk. /produkt/salling-oeko-agurk-oeko/61090/ frugt-og-groent 9.00 Stk. Foods
3 72008 Bananer 4 pak øko 2,20/Stk. /produkt/bananer-4-pak-oeko/72008/ frugt-og-groent 2.20 Stk. Foods
4 18323 Gulerødder 10,00/Kg. /produkt/salling-guleroedder/18323/ frugt-og-groent 10.00 Kg. Foods

Distribution of the categories of the webscraped products.

In [31]:
alt.data_transformers.disable_max_rows()

domain=['Foods', 'House', 'Other']
crange=['#2ca02c','#1f77b4', '#ff7f0e', ]

all_chart = alt.Chart(df_products).mark_bar().encode(
    x='count():Q',
    y=alt.Y('category:N', sort='-x', title='Inner Category'),
    color=alt.Color('outer_category:N',scale=alt.Scale(domain=domain, range=crange),title='Outer Category'),
    tooltip=['category', 'outer_category', 'count()']
).properties(
    title='Distribution of the products in the different categories',
    width=300,
    height=300
)#.configure(background='transparent')

df_filtered = df_products[df_products['outer_category'] == 'Foods']

foods_chart = alt.Chart(df_filtered).mark_bar().encode(
    x='count():Q',
    y=alt.Y('category:N', sort='-x', title='Inner Categories'),
    color=alt.Color('outer_category:N',scale=alt.Scale(domain=domain, range=crange),title='Outer Category'),
    tooltip=['category', 'outer_category', 'count()']
).properties(
    title='Distribution of the products in the Foods category only',
    width=300,
    height=300
)#.configure(background='transparent')


concated = (all_chart | foods_chart).interactive()#.configure(background='transparent')
concated
#concated.save('images/concated_chart_categories.png', scale_factor=2.0)
Out[31]:
In [32]:
df_products['outer_category'].value_counts()
Out[32]:
outer_category
House    19163
Foods    10269
Other     2853
Name: count, dtype: int64

The plot shows the distribution of products across the different categories. The categories are divided into 3 main categories: 'Food', 'House' and 'Other'. The Food category contains 10.269 products, the House category contains 19.163 products and the Other category contains 2.853 products.

In this project we will be focusing on the Food category.

2.2.2 Overview and cleaning of Textual data

On the bilkaToGo website it was discovered that some of the text descriptions included descriptions of the overall brand which was decided not to have any interesting information to use in the analysis. Therefore it was decided to remove this extra description that is not about the product itself. This was done in the scraping fase and the final df that was scraped is:

In [33]:
df_text['descriptions'] = df_text['descriptions'].str.lower().str.split('om salling').str[0].str.strip()

df_text['description_length'] = df_text['descriptions'].str.split().str.len()

#rename the column p_id to product_id to match the other dataframes
df_text = df_text.rename(columns = {'p_id':'product_id'})

#remove nan values
print(df_text["descriptions"].isna().value_counts())
df_text.dropna(inplace=True)

df_text.head()
descriptions
False    9906
True        5
Name: count, dtype: int64
Out[33]:
product_id descriptions description_length
0 18381 bananer har en anelse syrlig, mild og sød smag... 89.0
1 51061 peberfrugter har en sød og syrlig smag med hin... 88.0
2 61090 agurker smager mildt, sødt, en anelse syrligt ... 95.0
3 72008 bananer har en anelse syrlig, mild og sød smag... 89.0
4 18323 gulerødder har en sød, frugtig, mild og en ane... 70.0
Merging the datasets¶

We create one df with all information from only the foods category by merging the datasets. The neighbours dataframe only contains products from the 'Foods' category so we see a change in number of rows when merging

In [34]:
#converting the lists to lists and not strings
df_neighbours['neighbor_products_id'] = df_neighbours['neighbor_products_id'].apply(ast.literal_eval)

df_clean1 = pd.merge(df_products, df_neighbours, on='product_id', how='inner')
df_clean1 = pd.merge(df_clean1, df_text, on='product_id', how='inner')


df_clean1 = df_clean1[df_clean1['neighbor_products_id'].apply(lambda x: len(x) > 0)]
df_clean1 = df_clean1[df_clean1['descriptions'].apply(lambda x: x != "No description available")]

print(f"The overall amount of products only considering the foods category has gone from shape {df_products.shape} to shape {df_clean1.shape} after cleaning and merging")
#df_clean.head()
The overall amount of products only considering the foods category has gone from shape (32285, 8) to shape (9906, 11) after cleaning and merging
In [35]:
df_clean1['description_length'].describe()
Out[35]:
count    9906.000000
mean       57.560367
std        17.552294
min         1.000000
25%        47.000000
50%        55.000000
75%        65.000000
max       325.000000
Name: description_length, dtype: float64

These description vary a lot in length, from the shortest description length of just 1 word about this wine bottle with just '.' and the longest description length of 325 words about this bottle of rum. The average word length of the descriptions is around 58 with a standard deviation of 18 and a median of 55. And if we take a look at the distribution of the word length of the product descriptions, we can see that it is normally distributed.

We introduce a threshold to exclude products with uninformative descriptions from our textual analysis. By analyzing the distribution of word lengths in the descriptions, we determine an appropriate cut-off point.

In [12]:
#plot the length of the descriptions for the products
plt.figure(figsize=(6,3))
sns.kdeplot(df_clean1['description_length'])
plt.axvline(30, c='r')
plt.title('Length of descriptions')
plt.show()
/opt/anaconda3/envs/comsocsci2024/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
No description has been provided for this image

The threshold is decided as a cut-off of the low end tail of the normal distribution at any description ≤ 30 words.

In [36]:
word_length_threshold = 30

df_clean = df_clean1.loc[df_clean1['description_length'] >= word_length_threshold]    # Get rid of the descriptions that are less than 30 words
print(f"The amount of products removed by the threshold is {df_clean1.shape[0] - df_clean.shape[0]}")
The amount of products removed by the threshold is 149

By excluding products with descriptions of fewer than 30 words, we removed 149 products from the dataset. This cut-off resulted in a new average description length of 58.12 words, with a standard deviation of 17.05 and a median of 55 words.

2.2.3 Overview of the final and cleaned data

After making these filtering and cleaning decisions we end up with a dataset saved in the df_clean dataframe. An overview of the basic stats of the overall final and cleaned data is seen below:

In [37]:
print(f"In total we have {df_clean.shape[0]} products in the final dataset")
print(f"The amount of categories is {df_clean['category'].nunique()}")
df_clean.describe()
In total we have 9757 products in the final dataset
The amount of categories is 8
Out[37]:
product_id price_amount description_length
count 9757.000000 9757.000000 9757.000000
mean 76928.596392 140.941703 58.116532
std 34677.785903 140.459136 17.051650
min 14532.000000 0.020000 30.000000
25% 51050.000000 50.000000 47.000000
50% 81648.000000 102.670000 55.000000
75% 108349.000000 178.570000 66.000000
max 133516.000000 998.570000 325.000000
In [ ]:
print(df_clean.loc[df_clean['price_amount'].idxmin()])
print("-----------------------------------------------------------")
print(df_clean.loc[df_clean['price_amount'].idxmax()])
product_id                                                         116999
name                                                        Sødetabletter
price                                                           0,02/Stk.
link                              /produkt/salling-soedetabletter/116999/
category                                                         kolonial
price_amount                                                         0.02
unit                                                                 Stk.
outer_category                                                      Foods
neighbor_products_id    [25921, 88170, 110873, 20843, 105738, 108788, ...
descriptions            sødetabletter fra salling er perfekte til at s...
description_length                                                   40.0
Name: 5307, dtype: object
-----------------------------------------------------------
product_id                                                          39579
name                                      Single Malt Scotch Whisky 16 år
price                                                           998,57/L.
link                    /produkt/lagavulin-single-malt-scotch-whisky-1...
category                                                      drikkevarer
price_amount                                                       998.57
unit                                                                   L.
outer_category                                                      Foods
neighbor_products_id    [69280, 39574, 52192, 79286, 44457, 74077, 192...
descriptions            skotland - 43% - 70cllagavulin 16 års islay si...
description_length                                                  102.0
Name: 3467, dtype: object

Across 8 food categories, the average product price is approximately 140 kr., which initially seems quite high. It's important to note that this average is significantly influenced by the kilo prices, which pulls up the overall price average considerably. The most expensive item in the dataset is a Single Malt Scotch Whisky 16 år, with a price of 998.57kr/L and the cheapest food item is Sweetening tablets with the price of 0.02kr/stk. Not surprisingly, the cheapest items are those priced per piece rather than per kilo.

The data has been preprocessed and can now be extracted as a CSV file for the further analysis.

In [ ]:
df_clean.head()
##save the cleaned data to a csv file
#df_clean.to_csv('data/df_clean_data.csv', index=False, sep=";")
Out[ ]:
product_id name price link category price_amount unit outer_category neighbor_products_id descriptions description_length
0 18381 Bananer 2,75/Stk. /produkt/bananer/18381/ frugt-og-groent 2.75 Stk. Foods [18379, 18364, 50998, 51061, 53365, 84121, 197... bananer har en anelse syrlig, mild og sød smag... 89.0
1 51061 Peberfrugter røde 7,25/Stk. /produkt/peberfrugter-roede/51061/ frugt-og-groent 7.25 Stk. Foods [18364, 53365, 116664, 18381, 29439, 61090, 18... peberfrugter har en sød og syrlig smag med hin... 88.0
2 61090 Agurk øko 9,00/Stk. /produkt/salling-oeko-agurk-oeko/61090/ frugt-og-groent 9.00 Stk. Foods [72008, 19687, 37982, 39411, 51061, 41388, 116... agurker smager mildt, sødt, en anelse syrligt ... 95.0
3 72008 Bananer 4 pak øko 2,20/Stk. /produkt/bananer-4-pak-oeko/72008/ frugt-og-groent 2.20 Stk. Foods [61090, 19687, 53365, 39411, 18364, 29439, 404... bananer har en anelse syrlig, mild og sød smag... 89.0
4 18323 Gulerødder 10,00/Kg. /produkt/salling-guleroedder/18323/ frugt-og-groent 10.00 Kg. Foods [18364, 51061, 29439, 18381, 53365, 85465, 841... gulerødder har en sød, frugtig, mild og en ane... 70.0

2.3 Creating the network ¶

Working from the cleaned data from the previous section we use the neighbor_products_id to create edges between food-items.

In [38]:
network_df = pd.read_csv('data/df_clean_data.csv',delimiter=";")

#make names lowercase
network_df['name'] = network_df['name'].str.lower()
#create a new attribute for the ecology products
network_df['ecology'] = network_df['name'].str.contains('øko')

#unpack the list of neighbors using ast.literal_eval
network_df['neighbor_products_id'] = network_df['neighbor_products_id'].apply(ast.literal_eval)

#network_df.head()

Since it was decided only to focus on the network of food items that are in the foods category, we remove all neighboring ids that does not belong to the outer category of foods.

In [39]:
#remove all the neighbor_products_ids that are not in Foods category 
network_df["neighbor_products_id"] = network_df["neighbor_products_id"].apply(lambda x: [p_id for p_id in x if p_id in network_df["product_id"].values])

#sort the neighbor_products_ids
network_df["neighbor_products_id"].apply(lambda x: sorted(x))

#make a list of all the pairs of the items
def find_pairs(my_list,id):
    pairs = []
    for neighbour in my_list:
        pairs.append((id,neighbour))
    return pairs

network_df["neighbor_products_id"] = network_df["neighbor_products_id"].apply(lambda x: sorted(x))

big_list = []

for (p_id,neigbours,idx)  in zip(network_df['product_id'],network_df['neighbor_products_id'],network_df.index):
    big_list += [find_pairs(neigbours, p_id)]
network_df['Pairs'] = big_list

#network_df['Pairs']
In [ ]:
all_pairs = network_df['Pairs'].explode()

all_pairs.groupby(all_pairs).count().sort_values()

# Group by the pairs of item_ids, count occurrences, and sort by values
pair_counts = all_pairs.groupby(all_pairs).count().sort_values()

# Extract pairs (as index) and counts (as values)
pairs = pair_counts.index
counts = pair_counts.values

# Store pairs and counts as tuples in a list
Weighted_edge_list = [(item[0], item[1], count) for item, count in zip(pairs, counts)]

# Create an empty undirected graph
G = nx.Graph()

# Add weighted edges from the result list to the graph
G.add_weighted_edges_from(Weighted_edge_list)


#add the attribute information to the nodes
for node in G.nodes:
    G.nodes[node]["name"] = str(network_df[network_df["product_id"] == node]["name"].values[0])
    G.nodes[node]["category"] = str(network_df[network_df["product_id"] == node]["category"].values[0])
    G.nodes[node]["price_amount"] = network_df[network_df["product_id"] == node]["price_amount"].values[0]
    G.nodes[node]["unit"] = str(network_df[network_df["product_id"] == node]["unit"].values[0])
    G.nodes[node]["ecology"] = str(network_df[network_df["product_id"] == node]["ecology"].values[0])

G.nodes.data()
Save the Graph as a Json file¶
In [ ]:
graph_dict = nx.node_link_data(G)

#json_graph.node_link_data(G)
# Convert int64 types to native Python types
def convert(o):
    if isinstance(o, np.int64):
        return int(o)
    raise TypeError

# Write the graph dictionary to a JSON file
with open("data/network_with_attributes.json", "w") as f:
    json.dump(graph_dict, f, default=convert,indent=4)

2.3.1 Basic Stats of the Network ¶

In [41]:
print(f"The number of nodes in the graph is {G.number_of_nodes()}")
print(f"The number of edges in the graph is {G.number_of_edges()}")
print(f"The number of connected components in the graph is {nx.number_connected_components(G)}")
print(f"The density of the graph is {nx.density(G)}")
print(f"The graph is connected: {nx.is_connected(G)}")
The number of nodes in the graph is 9755
The number of edges in the graph is 74180
The number of connected components in the graph is 1
The density of the graph is 0.001559217856134302
The graph is connected: True

After creating the network, it is evident that there is a single large connected component, meaning the dataset forms one interconnected network. This network consists of 9,755 food items connected by 74,180 edges. As a result, all nodes are interconnected, with no isolated nodes emerging. This connectivity is due to our method of defining links between food items and the fact that the frequently bought together endpoint always returns a list of 10 items. Consequently, every item in the graph starts with at least 10 neighbors until we remove non-food items. This initial structure significantly impacts the degree distribution, which we will explore further in the next section.

3. Tools, theory and analysis ¶

3.1 Network analysis ¶

In the network analysis we will delve into details giving us insights into consumer behavior in terms of how food item are purchased together. As an overview we are going to investigate the following:

  1. The network properties as a whole to understand the overall structure that the food-items network follows by looking at the degree distribution. We will compare the real network with a random network to display if it follows non-random patterns that are indicative of underlying rules or behaviors guiding the connections.
  2. Investigate Assortativity to see if there is a tendency that some nodes in the network connect to other nodes that are similar in degree and other attributes.
  3. The network is partitioned into communities to can gain insights into how consumers group certain types of food together when shopping.
  4. The most Central nodes in the network is investigated in order to find the product that acts as hubs between communities.

3.1.1 Degree Distribution ¶

Random Network as a Baseline¶

We calculate p and k for the food-network using equation 3.2 from the network science text-book [3]

In [ ]:
N = G.number_of_nodes() 
L = G.number_of_edges() 

p = 2*L/(N*(N-1)) 

#calculate the average degree using p
k = p*(N-1)
print(f'The number of nodes in the Food-network is: {N}')
print(f'The number of links in the Food-network is: {L}')
print(f'The probability of a link between two nodes is: {p}')
print(f'Average degree of the Food-network: {k}')
The number of nodes in the Food-network is: 9755
The number of links in the Food-network is: 74180
The probability of a link between two nodes is: 0.001559217856134302
Average degree of the Food-network: 15.208610968733982

We define a function that models a random graph based on the Erdős-Rényi model, characterized by having N nodes where each pair of distinct nodes is connected with probability p:

In [ ]:
#function to calculate generate random network:
def generate_random_network(node_count, probability):
    random_network = nx.Graph()
    nodes = range(node_count)
    random_network.add_nodes_from(nodes)

    for i in nodes:
        for j in nodes:
            if i < j:
                if np.random.uniform(0, 1) < probability:
                    random_network.add_edge(i, j)

    return random_network

# Generate random network
random_network = generate_random_network(N, p)
In [ ]:
Normal_degrees = [degree for node, degree in random_network.degree()]
Food_degrees = [degree for node, degree in G.degree()]   
In [ ]:
#Visualize the degree distribution of the random network and the Food-network
fig, ax = plt.subplots(figsize=(10, 6))

# Define bins using logspace for logarithmic scaling
bins_norm = np.logspace(0, np.log10(max(Normal_degrees)), 75)
bins_food = np.logspace(np.log10(min(Food_degrees)), np.log10(max(Food_degrees)), 75)

# define the degree distribution of the random network
hist_normal, edges_normal = np.histogram(Normal_degrees, bins=bins_norm, density=True)
x_norm = (edges_normal[1:] + edges_normal[:-1]) / 2

# Filter empty bins
xx_norm, yy_norm = zip(*[(i, j) for (i, j) in zip(x_norm, hist_normal) if j > 0])

# define the degree distribution of the Computational Social Scientists network
hist_food, edges_food = np.histogram(Food_degrees, bins=bins_food, density=True)
x_food = (edges_food[1:] + edges_food[:-1]) / 2

# Filter empty bins
xx_food, yy_food = zip(*[(i, j) for (i, j) in zip(x_food, hist_food) if j > 0])

# Plot them
ax.plot(xx_norm, yy_norm, marker='.', label='Random Network', color='r')
ax.plot(xx_food, yy_food, marker='.', label='Foods Network', color='green')

# Calculate average degree for both networks
avg_degree_random = np.mean(Normal_degrees)
avg_degree_food = np.mean(Food_degrees)
med_degree_food = np.median(Food_degrees)

# Add vertical lines for average degrees
ax.axvline(avg_degree_random, color='r', linestyle='--', label=f'Average Degree (Random Network): {avg_degree_random:.2f}')
ax.axvline(avg_degree_food, color='orange', linestyle='--', label=f'Average Degree (Food Network): {avg_degree_food:.2f}')
ax.axvline(med_degree_food, color='green', linestyle='--', label=f'Median Degree (Food Network): {med_degree_food:.2f}')

# Set log scale for both axes
ax.set_xscale('log')
ax.set_yscale('log')

# Set labels and title
ax.set_xlabel('Degree k (log scaled)')
ax.set_ylabel('Probability Distribution p(k) (log scaled)')
ax.set_title('Degree Distribution the Foods Network compared to a Random Network')
#set lengent to upper right corner
ax.legend(loc='upper right')

ax.grid(True) 

#plt.savefig('images/Degree_Distribution_Plot.png', transparent=True)
plt.show()
No description has been provided for this image

The Foods Network (in green) resembles a power law distribution after degree 10 as the distribution becomes pretty much straight in log-log scale. As mentioned all nodes are forced to start with 10 neighbors because this is what the API call returns. In the cleaning we remove items that are not in the foods category resulting in some nodes getting less than 10, and this is what is apparant in the first part of the distribution. The median degree of the Foods Network is 13 and this is a better sample estimate for the centrality of the population than the average, as this is affected by the heavy tail in the powerlaw. The Foods Network shows a less centrally concentrated and wider spread in degree distribution compared to the Random Network. This suggests a higher variance in connectivity among nodes, with some nodes having significantly higher degrees than the average. This spread could indicate the presence of a few highly popular items and many less connected items.

The Random Network (in red), displays a tighter, more symmetric distribution around the average degree, characteristic of the binomial distribution typical in Erdős-Rényi models[3]

The high-degree nodes in the Foods Network likely represent items that are versatile or have broad appeal, causing them to appear frequently on the "frequently bought together" list across various other items. For example, common ingredients like milk, bread, or eggs might be expected to have higher degrees, but this we will delve more into detail about in the next sections. These high-degree nodes can provide valuable insights into consumer preferences in these grocery stores and can influence marketing strategies, product placements, and inventory decisions.

In [ ]:
print(f"The average degree < k > of the food network is {avg_degree_food}")
print(f"The ln(N)) is {np.log(N)}")
The average degree < k > of the food network is 15.208610968733982
The ln(N)) is 9.185535253057212

Because < k > > ln(N) we are in the Connected Regime. This means that all components are absorbed by the giant component, resulting in a single connected network [3]

3.1.2 Assortativity ¶

Assortativity provides insights into which items tend to be bought together by revealing the tendency of nodes to connect with other nodes that are similar or dissimilar in a defined way. This can help us understand patterns in consumer behavior and product correlations.

In order to compare if the assortativity of the food-network is significant from random we compare it to 100 different random networks, where the assortativity has been broken by a configuration model that uses the double edge swap algorithm to ensure each node retains its original degree but with altered connections.

The configuration model and 100 random networks¶

In [42]:
def shuffle_net(network):
    n_edges = network.number_of_edges()
    new_network = network.copy()
    edges = list(new_network.edges)
    e = len(edges)
    for _ in range(n_edges*10):
        idx1, idx2 = random.randint(0,e-1), random.randint(0,e-1)
        if idx1 != idx2:
            u,v  = edges[idx1]
            x,y = edges[idx2]
            prob = np.random.random()
            if u != y and v != x and u != x and v != y:
                if prob < 0.5:
                    new_network.remove_edge(u,v)
                    new_network.add_edge(v,u)
                if y not in new_network.neighbors(u):
                    if x not in new_network.neighbors(v):
                        new_network.remove_edge(u,v)
                        new_network.remove_edge(x,y)
                        new_network.add_edge(u,y)
                        new_network.add_edge(x,v)
                        edges[idx1] = (u,y)
                        edges[idx2] = (x,v)
    return new_network
In [43]:
print(f"""
      Number of edges in the random network: {shuffle_net(G).number_of_edges()}
      Number of edges in the original network: {G.number_of_edges()}""")
      Number of edges in the random network: 74180
      Number of edges in the original network: 74180
In [ ]:
# Shuffle the food-network 100 times
network_list_100 = Parallel(n_jobs=4)(delayed(shuffle_net)(G) for _ in tqdm(range(100)))

Degree Assortativity¶

In [47]:
degree_value_random = Parallel(n_jobs=4)(delayed(nx.degree_assortativity_coefficient)(G) for G in network_list_100)
In [48]:
degree_value_food = nx.degree_assortativity_coefficient(G)
print(f"The degree assortativity coefficient of the food network is: {degree_value_food}")
print(f"The average degree assortativity coefficient of the random network is: {np.mean(degree_value_random)}")
The degree assortativity coefficient of the food network is: 0.05364441112708626
The average degree assortativity coefficient of the random network is: -0.03372721860486165

Attribute Assortativity¶

The Assorsativity coefficient measures the similarity of connections in the graph with respect given attributes - here we look into:

  1. Category
  2. Ecology
  3. Price
In [49]:
category_assort_random = Parallel(n_jobs=4)(delayed(nx.attribute_assortativity_coefficient)(network, 'category') for network in tqdm(network_list_100))
100%|██████████| 100/100 [00:05<00:00, 19.28it/s]
In [50]:
category_value_food = nx.attribute_assortativity_coefficient(G, "category")
print(f"The category assortativity coefficient of the food network is: {category_value_food}")
print(f"The average category assortativity coefficient of the random network is {np.mean(category_assort_random)}")
The category assortativity coefficient of the food network is: 0.49449134415630075
The average category assortativity coefficient of the random network is -0.0008480463459677784
In [51]:
eco_assort_random = Parallel(n_jobs=4)(delayed(nx.attribute_assortativity_coefficient)(network, 'ecology') for network in tqdm(network_list_100))
100%|██████████| 100/100 [00:05<00:00, 19.00it/s]
In [52]:
eco_value_food=nx.attribute_assortativity_coefficient(G, "ecology")
print(f"The ecology assortativity coefficient of the food network is: {eco_value_food}")
print(f"The average ecology assortativity coefficient of the random network is {np.mean(eco_assort_random)}")
The ecology assortativity coefficient of the food network is: 0.5951001229613563
The average ecology assortativity coefficient of the random network is -0.0009896487542824398
In [ ]:
price_assort_random = Parallel(n_jobs=4)(delayed(nx.numeric_assortativity_coefficient)(network, 'price_amount') for network in tqdm(network_list_100))
In [46]:
price_value_food=nx.numeric_assortativity_coefficient(G, "price_amount")
print(f"The price assortativity coefficient of the food network is: {price_value_food}")
print(f"The average price assortativity coefficient of the random network is {np.mean(price_assort_random)}")
The price assortativity coefficient of the food network is: 0.35536678301562635
The average price assortativity coefficient of the random network is -0.001631050753573886
In [53]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10))  # 2x2 grid of plots
fig.subplots_adjust(hspace=0.3, wspace=0.2)

# Titles for each plot
titles = ['Degree Assortativity', 'Category Assortativity', 'Ecology Assortativity', 'Price Assortativity']

# Data for each plot
data_random = [degree_value_random, category_assort_random, eco_assort_random, price_assort_random]
data_food = [degree_value_food, category_value_food, eco_value_food, price_value_food]

# Plotting each histogram
for ax, title, random, food in zip(axes.flatten(), titles, data_random, data_food):
    ax.hist(random, bins=30, alpha=0.7, label='Random networks')
    ax.axvline(food, color='red', label='Original network')
    ax.set(title=title, xlabel='Assortativity coefficient', ylabel='Frequency')
    ax.legend(loc='upper right')

# Display the plot
plt.show()
No description has been provided for this image

When comparing the significance of the degree assortativity and the attribute assortativities in the food network compared to the random, both metrics offer important insights but on different aspects of the network structure.

When considering the attribute assortativity on the category and the ecological attribute it is apparent that these values are very significant different than the random. This suggest that products within the same categories tend to get purchased together. Additionally eco products tend to get purchased with other eco products as well, which is also what we might expect from knowing certain consumer behaviours in real life. The high price assortativity suggests that expensive products are often purchased together, and similarly for cheaper products. This pattern makes sense, for example, when buying alcoholic beverages or various meats during the same shopping trip. However, it is important to note that our data only includes prices per kilo and per piece, making the result a bit misleading because the price does not reflect what is actually paid when going shopping.

The degree assortativity coefficient of 0.0536, though very modest, is significantly different from the average in the random networks. This suggests that there is a preference for items with similar degrees of connectivity to be purchased together. For example, niche products might frequently be purchased together, or popular items might commonly appear in the same baskets. Although the effect size is smaller than for category assortativity or ecology assortativity, it's still a statistically significant deviation from randomness, indicating non-random structural patterns.

3.1.3 Communities ¶

By identifying communities, we can gain insights into how consumers group certain types of food together. This helps in understanding purchasing patterns and preferences.

When partitioning the network into communities we use the Louvain Community Detection Algorithm built in the networkX function because we want to investigate the underlying and naturally existing structures, and this algorithm finds the non-overlapping communities. We also update the network_df with a community coloumn which is used in the network analysis section.

In [24]:
communities = nx.community.louvain_communities(G,seed=42)
network_df['community'] = None
for idx, community in enumerate(communities):
    for node in community:
        G._node[node]['community']=idx
        network_df.loc[network_df['product_id']==str(node), 'community'] = idx
network_df=network_df.dropna()

We want to get a subgraph containing the 5 largest communities, so that we can plot these to show the magnitude of the amount of products within, as well as label them using the information we get in the NLP section.

In [ ]:
# create a subgraph containing the 5 largest communities
nodes = []
sorted_comms = sorted(communities, key=len, reverse =True)
for community in sorted_comms[:5]:
   nodes.extend(community)
subgraph = G.subgraph(nodes)

community_length = {str(G._node[next(iter(com))]['community']):len(com) for com in sorted_comms[:5]}

# Plot the subgraph
colors = ['cyan','blue','forestgreen','magenta','darkorange','dark_violet']
n_colors = len(colors) 
sorted_communities = dict(sorted(nx.get_node_attributes(subgraph,'community').items(),key=lambda x: x[1],reverse=True))
community_colors = {community: colors[i % n_colors] for i, community in enumerate(set(sorted_communities.values()))}
for (node, commun) in sorted_communities.items():
   subgraph.nodes[node]['color'] = community_colors[commun]

print(f'The top 5 biggest commmunities and their assigned color in the graph visualisation: {community_colors}\n')
print(f'Number of nodes in each community is: {community_length}\n')
print(f'The amount of nodes in our top-5 community subgraph is: {len(subgraph)}')

network, _ = visualize(subgraph)
fig, ax = draw_netwulf(network)
ax.set_title(title)
The top 5 biggest commmunities and their assigned color in the graph visualisation: {3: 'cyan', 16: 'blue', 18: 'forestgreen', 19: 'magenta', 24: 'darkorange'}

Number of nodes in each community is: {'16': 1742, '24': 1367, '18': 948, '19': 899, '3': 595}

The amount of nodes in our top-5 community subgraph is: 5551

image.png

From the data and plot we find that community 16 is the largest, followed by 24 and then 18 etc. In the next section we will investigate the most central nodes in the entire network (all 25 communities) and then use that information in the following NLP section in order to make sense of these communities.

3.1.4 Centrality ¶

We wish to investigate the most central nodes in our network, to find out which products are the most commonly bought during grocery shopping. We will look at degree, closeness and betweenness centrality and investigate whether these overlap.

Degree centrality
Degreee centrality simply tells how many edges each node in the network has. The assumption is that more central nodes work as hubs in the network and thus have a high degree [4], we will look at the 15 nodes with the highest degree in order to see what products work as hubs in the context of grocery shopping.

In [ ]:
degree_centrality = nx.degree_centrality(G)
sorted_degree = sorted(degree_centrality.items(), key=lambda item: item[1], reverse=True)
top15_degree = sorted_degree[:15]
for idx, (node,degree) in enumerate(top15_degree):
    name = G._node[node]['name']
    print(f'The {idx+1}. highest degree product is {name} with {round(degree,4)} degree_centrality')
The 1. highest degree product is agurk with 0.0404 degree_centrality
The 2. highest degree product is agurk with 0.0313 degree_centrality
The 3. highest degree product is skrabeæg m/l with 0.0185 degree_centrality
The 4. highest degree product is peberfrugter røde with 0.0182 degree_centrality
The 5. highest degree product is bananer with 0.017 degree_centrality
The 6. highest degree product is agurk øko with 0.016 degree_centrality
The 7. highest degree product is smørbar with 0.0158 degree_centrality
The 8. highest degree product is æg m/l øko with 0.0132 degree_centrality
The 9. highest degree product is bananer 4 pak øko with 0.0132 degree_centrality
The 10. highest degree product is finvalsede havregryn øko with 0.0132 degree_centrality
The 11. highest degree product is hakkede tomater øko with 0.0124 degree_centrality
The 12. highest degree product is minimælk 0,4% fedt with 0.0111 degree_centrality
The 13. highest degree product is hvedemel øko with 0.01 degree_centrality
The 14. highest degree product is remoulade øko with 0.0096 degree_centrality
The 15. highest degree product is mørk pålægschokolade 53% kakao øko with 0.0095 degree_centrality

The closeness centrality
Next we will look at closeness centrality, which tell us how "close" each node is to the other nodes in the network, meaning more central nodes have lower scores as they do not have to travel as far along the paths to get to other nodes in the network.
We still only consider the top 15 most central nodes, in networkx the centrality score is higher for more central nodes, meaning we consider the highest values in this case.

In [26]:
centrality_of_G = nx.closeness_centrality(G)
sorted_closeness = sorted(centrality_of_G.items(), key=lambda item: item[1], reverse=True)
top15_closeness = sorted_closeness[:15]
top15_closeness
Out[26]:
[(18364, 0.34026372706342006),
 (29439, 0.33631003689273525),
 (61090, 0.32341921151231806),
 (39411, 0.32219065865098767),
 (53366, 0.3211087700816434),
 (19721, 0.3191336212537626),
 (82376, 0.3181551307978342),
 (119482, 0.31776127182694813),
 (72008, 0.3175957280541808),
 (53365, 0.3173683868028893),
 (40165, 0.3154490475728469),
 (51061, 0.3145335526103641),
 (18381, 0.31242793081358106),
 (41860, 0.3096606241467983),
 (39353, 0.3087588237156152)]

It is interesting to see, which communities these node-hubs bridge across, in order to do this, we iterate over each of our top 15 nodes and look at the assigned communities of their respective neighbours. The dictionary below show the set of communities each node bridge across:

In [27]:
bridge_dict_closeness = dict()
for (node, val) in top15_closeness:
    bridge_dict_closeness[str(node)] = set()
    for (self, neighbour_node) in G.edges(node):
        bridge_dict_closeness[str(self)].add(G._node[neighbour_node]['community'])

closeness_hubs = dict()
for node, bridges in bridge_dict_closeness.items():
    closeness_hubs[G._node[int(node)]['name']] = bridges
closeness_hubs
Out[27]:
{'agurk': {3, 4, 10, 11, 14, 16, 18, 19, 21, 22, 24},
 'agurk øko': {4, 11, 16, 18, 20, 24},
 'finvalsede havregryn øko': {11, 16, 18, 24},
 'æg m/l øko': {3, 8, 11, 16, 20, 22, 24},
 'smørbar': {5, 7, 10, 13, 14, 16, 18, 19, 21, 22, 24},
 'hvedemel øko': {10, 14, 16, 18, 22, 24},
 'hakkede tomater øko': {16, 24},
 'bananer 4 pak øko': {4, 11, 16, 17, 18, 24},
 'skrabeæg m/l': {3, 9, 10, 16, 17, 18, 22, 24},
 'rosiner øko': {4, 14, 16, 20, 24},
 'peberfrugter røde': {5, 12, 16, 18, 22, 24},
 'bananer': {4, 6, 11, 12, 16, 17, 18, 21, 24},
 'remoulade øko': {3, 4, 9, 11, 16, 18, 19, 21, 22, 24},
 'mørk pålægschokolade 53% kakao øko': {11, 16, 17, 22, 24}}

Betweenness centrality
This centrality measure tell us which nodes is most often in the shortest path between two arbitrary nodes in the network. Nodes with high betweenness centrality score are usually considered gatekeepers of information and is thus informative to examine in the context of bridging across comunities. [4]

In [ ]:
betweenness_of_G = nx.betweenness_centrality(G)
sorted_betweenness = sorted(betweenness_of_G.items(), key=lambda item: item[1], reverse=True)
top15 = sorted_betweenness[:15]
top15
Out[ ]:
[(18364, 0.029281001959232396),
 (29439, 0.021855598066004425),
 (19721, 0.007246604658721504),
 (39353, 0.006221691778238507),
 (61090, 0.006172700660317658),
 (53366, 0.005927257490907317),
 (39411, 0.005753901560903707),
 (53365, 0.0054668646123460255),
 (41860, 0.005017276796866281),
 (82376, 0.004884098461488741),
 (18381, 0.004677436261855312),
 (119482, 0.004525198261389455),
 (72008, 0.004478263697298725),
 (51061, 0.004296037487222067),
 (71507, 0.004068930069322418)]

Below we print out the communities our betweenness hubs bridge across. The communities are labeled down in the NLP section which grants intuition as to what each community represent in terms of shopping theme.

In [ ]:
bridge_dict = dict()
for (node, val) in top15:
    bridge_dict[str(node)] = set()
    for (self, neighbour_node) in G.edges(node):
        bridge_dict[str(self)].add(G._node[neighbour_node]['community'])
betweenness_hubs = dict()
for node, bridges in bridge_dict.items():
    betweenness_hubs[G._node[int(node)]['name']] = bridges
betweenness_hubs
Out[ ]:
{'agurk': {3, 4, 10, 11, 14, 16, 18, 19, 21, 22, 24},
 'smørbar': {5, 7, 10, 13, 14, 16, 18, 19, 21, 22, 24},
 'mørk pålægschokolade 53% kakao øko': {11, 16, 17, 22, 24},
 'agurk øko': {4, 11, 16, 18, 20, 24},
 'æg m/l øko': {3, 8, 11, 16, 20, 22, 24},
 'finvalsede havregryn øko': {11, 16, 18, 24},
 'skrabeæg m/l': {3, 9, 10, 16, 17, 18, 22, 24},
 'remoulade øko': {3, 4, 9, 11, 16, 18, 19, 21, 22, 24},
 'hvedemel øko': {10, 14, 16, 18, 22, 24},
 'bananer': {4, 6, 11, 12, 16, 17, 18, 21, 24},
 'hakkede tomater øko': {16, 24},
 'bananer 4 pak øko': {4, 11, 16, 17, 18, 24},
 'peberfrugter røde': {5, 12, 16, 18, 22, 24},
 'tonic': {2, 3, 6, 8, 16, 18, 19, 24}}

We find that the two centrality scores are almost identical, with the exception that closeness have "rosiner øko" meaning ecological raisins. Meanwhile betweenness have "tonic". The conclusion is, that the nodes found are very strong hubs that are very central in the network and, as shown above, bridge across a wide variety of communities. The thing remaining is, to clarify what these communities represent, which is shown in the next section.

3.2 Text analysis ¶

The text used for the text analysis were collected from web-scraping description of products of the BilkaToGo website. The descriptions are a small text about a certain procuts typically explaining what the product tastes like, and what occasions/setting they are typically enjoyed in. As mentioned, this project focuses exclusively on categories related to food products. Some products also include descriptions about their brands, which we have excluded in the data collection.

In [15]:
data_clean = pd.read_csv('data/df_clean_data_updated_comm_new1.csv', sep=";")

3.2.1 Tokenization and distribution of text length ¶

When working with text analysis tools it is essential that we tokenize the text, such that we break the text into more managable pieces for NLP applications. This is done with nltk python package that can remove unwanted tokens such as danish stop words, punctuation etc. The distribution of the tokenized text can be seen below. note that we added a stop words "kan"(can) to the stopwords as this is a regular expression in danish. Especially in descriptions of foods that can be used in recipes

In [16]:
def preprocess_text(tokens):
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    porter = PorterStemmer()
    stop_words = set(stopwords.words('danish'))
    stop_words.add("kan")
    
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    #tokens = [porter.stem(word=t, to_lowercase=False) for t in tokens]
    
    tokens = [token for token in tokens if token.strip()]
    
    
    return tokens
In [17]:
data_clean['Tokens'] = data_clean['descriptions'].apply(word_tokenize).apply(preprocess_text)
token_list = data_clean['Tokens']

data_clean['token_length'] = data_clean['Tokens'].apply(len) #length
In [18]:
#plot the length of the descriptions for the products
plt.figure(figsize=(6,4))
sns.kdeplot(data_clean['token_length'])
plt.title('Length of descriptions')
plt.show()
/opt/anaconda3/envs/comsocsci2024/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
No description has been provided for this image

We note that it still has the same shape as a normal distribution as seen in the data-cleaning. This distribution of tokenized descriptions has a mean of 33.24 a standard deviation of 10.11 and a median of 32. Which means that with tokenizing the text, the mean of the desription length decreased by ~25 words.

In [82]:
data_clean['token_length'].describe()
Out[82]:
count    9755.000000
mean       33.237827
std        10.106843
min        13.000000
25%        27.000000
50%        32.000000
75%        38.000000
max       196.000000
Name: token_length, dtype: float64

3.2.2 Frequency-Rank-Plot ¶

We now go further into the text analysis. We will start by analysing our corpuses of product-descriptions for Zipf's law of abbreviation. This linguisic law states that the value of the n'th entry is inversly proportinal to n when token frequency is sorted in a list of decreasing order. This essentially means that the most common token in our corpus should occur twice as often as the next common one, three times as often as the third most common one and so on. We will check if this is the case in our corpuses by plotting the frequency of each token with the ideal zipf's law to compare:

In [83]:
comprehensive_token_list = list(chain(*[token_list[idx] for idx in token_list.index]))
In [84]:
amount_of_words = len(Counter(comprehensive_token_list).most_common())
word_count_list = Counter(comprehensive_token_list).most_common()
word_count = [word_count_list[i][1] for i in tqdm(range(amount_of_words))]
100%|██████████| 19389/19389 [00:00<00:00, 3877896.15it/s]
In [85]:
def plot_frequency_rank(word_count, amount_of_words):
    # Get ranks and frequencies
    ranks = list(range(1, amount_of_words + 1))
    frequencies = word_count
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(ranks, frequencies, marker='o', linestyle='-', label='Word Frequencies')
    
    # Calculate Zipf's law line
    constant = frequencies[0]  # Constant is the frequency of the most frequent word
    zipf_line = [constant / rank for rank in ranks]
    plt.plot(ranks, zipf_line, linestyle='--', color='red', label="Zipf's Law")
    
    plt.title('Frequency-Rank Plot')
    plt.xlabel('Rank')
    plt.ylabel('Frequency')
    plt.xscale('log')  # Use logarithmic scale for better visualization
    plt.yscale('log')
    plt.grid(True)
    plt.legend()
    plt.show()
In [86]:
plot_frequency_rank(word_count, amount_of_words)
No description has been provided for this image

We see that the frequency-rank plot follows somewhat the ideal Zipf's law but there are deviations. There could be a couple of reasons for this. Firstly, Zipf's law is a statistically law that describes the average behaviour of words in a large corpus. When we look at a smaller corpus with a specific topic, like food in our case, there will be natural variations around the expected rank-frequency distribution. Furthermore, like stated our corpus has a specific topic of descriptions of foods. This means that there will be a specific vocabulary used related to taste, texture, ingredient etc. additionally, given that these are descriptions there might be a tendancy to use a richer variety of adjectives and less common terms than general danish text

3.2.3 TF-IDF scores and Wordclouds ¶

With this approach, we aim to gain insights into consumer buying patterns by analyzing the words that characterize certain natural communities within the frequently-bought-together network. This analysis can reveal what makes products land in the same basket on a shopping trip. In order to do this, we use the TF-IDF scores on the tokenized product descriptions. These TF-IDF scores of words help us identify what words specifically are unique for each community. The TF-IDF scores of words for the top 9 communities can be seen below:

In [88]:
# Define a function to extract tokens and concatenate them for each community
def concatenate_tokens(group):
    tokens_list = group['Tokens'].explode().tolist() #should be list of strings
    return tokens_list


# Group by community and apply the function to concatenate tokens
community_tokens = data_clean.groupby('community').apply(concatenate_tokens)


# Reset index to convert the groupby result back to a DataFrame
community_tokens = community_tokens.reset_index(name='Tokens')
In [89]:
#print some insights
for i in range(max(data_clean["community"])+1):
    c = community_tokens['community'][i]
    print(f"for community {c} we have the following number of tokens: {len(community_tokens['Tokens'][i])}")
for community 0 we have the following number of tokens: 5781
for community 1 we have the following number of tokens: 7333
for community 2 we have the following number of tokens: 2504
for community 3 we have the following number of tokens: 18998
for community 4 we have the following number of tokens: 8614
for community 5 we have the following number of tokens: 2230
for community 6 we have the following number of tokens: 6692
for community 7 we have the following number of tokens: 3760
for community 8 we have the following number of tokens: 10238
for community 9 we have the following number of tokens: 11936
for community 10 we have the following number of tokens: 5999
for community 11 we have the following number of tokens: 7164
for community 12 we have the following number of tokens: 7867
for community 13 we have the following number of tokens: 2839
for community 14 we have the following number of tokens: 8649
for community 15 we have the following number of tokens: 4817
for community 16 we have the following number of tokens: 58447
for community 17 we have the following number of tokens: 9347
for community 18 we have the following number of tokens: 37450
for community 19 we have the following number of tokens: 27230
for community 20 we have the following number of tokens: 8311
for community 21 we have the following number of tokens: 6154
for community 22 we have the following number of tokens: 13202
for community 23 we have the following number of tokens: 1862
for community 24 we have the following number of tokens: 46811
In [90]:
Top_9_communities = list(data_clean['community'].value_counts().nlargest(9).index)
TF_IDF = {}
data_product = pd.read_csv('data/df_clean_data_updated_comm_new1.csv', sep=";")
TF = {}
top_products = {}

for c in Top_9_communities:
    #Calculate token frequency within a community.
    products = list(data_clean.loc[data_clean["community"] == c, "product_id"])

    token_doc_frequency = Counter()
    token_term_frequency = Counter()
    for idx, p in enumerate(products):
        tokens = list(data_clean.loc[data_clean["product_id"] == p, "Tokens"])[0]
        token_doc_frequency.update(set(tokens))
        token_term_frequency.update(tokens)
        

    # Calculate IDF and TF-IDF scores for the top 9 largest communities
    TF_IDF[c] = {}
    TF[c] = {}
    for token, freq  in token_term_frequency.items():
        TF[c][token] = freq / token_term_frequency.total()
    
    for token, freq  in token_doc_frequency.items():
        idf = np.log(idx / freq)
        TF_IDF[c][token] = TF[c][token] * idf
        
    
    # Top 10 TF-IDF words in the community
    top_TF_IDF = sorted(TF_IDF[c].items(), key=lambda x: x[1], reverse=True)[:20]
    print("-----------------------------------------------------------------------")
    print("Community", c, "top 10 TF-IDF words:")
    print("-----------------------------------------------------------------------")
    for token, score in top_TF_IDF:
        print(token, score)
    

    #Top 3 products within communities
    authors = data_clean.loc[data_clean['community']==c].sort_values(by=['degree'], ascending = False)
    
    top_products[c] = authors.product_id.unique()[:3]
    print("-----------------------------------------------------------------------")
    print(f'Top 3 products in this community {c} is :')
    print("-----------------------------------------------------------------------")
    print([list(data_clean.loc[data_clean['product_id']==top_products[c][i]].name)[0] for i in range(3)])
    print('\n')
    
    
-----------------------------------------------------------------------
Community 16 top 10 TF-IDF words:
-----------------------------------------------------------------------
lækker 0.012585886956518667
salling 0.012187675804105597
smag 0.011927085469768456
prøv 0.011018183475615149
nyd 0.010610860296630572
frisk 0.010316309193861157
brug 0.009946364959113625
ved 0.009709586242448536
lidt 0.00968964292960291
sammen 0.00961125401293152
giver 0.009561492721698199
velsmagende 0.00943554275364658
både 0.009364547413562975
god 0.00929300163764502
eksempel 0.009066069077126348
cremet 0.00890956557797443
let 0.008892040524598308
perfekt 0.008721939188393495
så 0.008583856430754504
sød 0.008545096795541163
-----------------------------------------------------------------------
Top 3 products in this community 16 is :
-----------------------------------------------------------------------
['agurk', 'agurk', 'skrabeæg m/l']


-----------------------------------------------------------------------
Community 24 top 10 TF-IDF words:
-----------------------------------------------------------------------
økologisk 0.011928131509043007
salling 0.011265402716786879
smag 0.011240240223448014
økologiske 0.011171557976733464
kaffe 0.011046862004062045
brug 0.011036072869396537
prøv 0.010943716510158208
øko 0.010804717759182993
retter 0.010642736782579279
lækker 0.010634864768113981
lidt 0.009829783368724268
eksempel 0.009631233440533574
nyd 0.009398780535264818
ved 0.009093336861002523
frisk 0.009030315586437495
giver 0.008948686437036408
god 0.008924430044041396
let 0.00855599027158097
sød 0.008489551092835002
skøn 0.008413821203677018
-----------------------------------------------------------------------
Top 3 products in this community 24 is :
-----------------------------------------------------------------------
['finvalsede havregryn øko', 'æg m/l øko', 'hakkede tomater øko']


-----------------------------------------------------------------------
Community 18 top 10 TF-IDF words:
-----------------------------------------------------------------------
gin 0.012865262647088274
vinen 0.01149468966167348
noter 0.01048504249730304
smag 0.010444257629895613
ved 0.010331652978745224
frisk 0.00984605041416286
vin 0.009578374600664465
nyd 0.009252567173331365
smagen 0.009191086821773305
farve 0.009152088310717277
giver 0.008486688823275433
forfriskende 0.008157518262547956
sammen 0.008100056914931746
let 0.00801736865480589
samt 0.007917151546620817
tonic 0.007853311878635423
duft 0.007815940495977792
dejlig 0.007716717330250669
rom 0.007654079524321919
fyldig 0.00758860495766667
-----------------------------------------------------------------------
Top 3 products in this community 18 is :
-----------------------------------------------------------------------
['tonic', 'london dry gin', 'indian tonic water']


-----------------------------------------------------------------------
Community 19 top 10 TF-IDF words:
-----------------------------------------------------------------------
chokolade 0.01538117226694309
lakrids 0.014813989966841665
mix 0.014122979106027379
smag 0.01399275460286946
mælkechokolade 0.013137239411703782
lækker 0.012575189717221782
søde 0.012474771627895793
nyd 0.01215971550793579
lækre 0.011988702890385528
små 0.011941419645255288
toms 0.011575706947044672
sammen 0.011194132412971465
både 0.011062807958070601
sød 0.011007453862944818
karamel 0.010561590239859957
så 0.010454766442821858
giver 0.010372017555838846
forskellige 0.010237941492983264
blanding 0.010052304527019497
m 0.010023541583593
-----------------------------------------------------------------------
Top 3 products in this community 19 is :
-----------------------------------------------------------------------
['lakridsbolcher', 'tappsy', 'chocofant']


-----------------------------------------------------------------------
Community 3 top 10 TF-IDF words:
-----------------------------------------------------------------------
oliven 0.016204153727310776
lækker 0.013984140151505243
salling 0.012265731949859941
ost 0.011820259408087862
nyd 0.011490311473949646
prøv 0.011234482179868208
smag 0.01101439250965244
cremet 0.010630566611902599
sammen 0.010587992727412258
del 0.010433219416576905
giver 0.010339563303779862
krydret 0.010316244061016997
salat 0.010206443439937941
ekstra 0.010090226708647987
brød 0.010038476365819752
velsmagende 0.009995220934863776
osten 0.009944320996431184
sandwich 0.009824894152463627
dine 0.009797612736985411
let 0.009706981538476359
-----------------------------------------------------------------------
Top 3 products in this community 3 is :
-----------------------------------------------------------------------
['fyldte røde peberfrugter m. flødeost', 'lufttørret italiensk bresaola i skiver', 'serranoskinke, salchichon og chorizo']


-----------------------------------------------------------------------
Community 22 top 10 TF-IDF words:
-----------------------------------------------------------------------
smag 0.013964041011196198
desserter 0.012607087574579847
lækker 0.012567518029508257
chokolade 0.012406813341815653
kager 0.012275228576760795
brug 0.011496200653748906
lave 0.011174053543536294
dine 0.011111210870453683
is 0.010565351238147582
salling 0.010553063993362298
prøv 0.010207588604035396
bruges 0.010162916437048027
vand 0.010071413183356238
boller 0.010044484244460838
bagværk 0.009903733774073843
både 0.009586076821697166
pynt 0.009538409546711365
oetker 0.00951185576164966
velegnet 0.00944221515195574
sød 0.00942818785525433
-----------------------------------------------------------------------
Top 3 products in this community 22 is :
-----------------------------------------------------------------------
['husblas', 'flormelis', 'sukker']


-----------------------------------------------------------------------
Community 9 top 10 TF-IDF words:
-----------------------------------------------------------------------
kaffe 0.01337116772642675
kondi 0.012628256464222914
faxe 0.012628256464222914
energidrik 0.011943202967085552
giver 0.011894088927269529
lækker 0.011805188700472882
frisk 0.011696900761424345
forfriskende 0.01166801922515927
mælk 0.010905484655665596
nyd 0.01053128428127807
farten 0.010515531564520167
forfriskning 0.010015192337136946
trænger 0.009451138053689396
ml 0.009418541259578922
velsmagende 0.009395364327394611
skøn 0.009285253754348776
sød 0.009238774921938035
iskaffe 0.009232810124623897
dagen 0.009048550295121062
vand 0.009027330056963205
-----------------------------------------------------------------------
Top 3 products in this community 9 is :
-----------------------------------------------------------------------
['tyggegummi m. frugtsmag', 'coca cola', 'sportssodavand sukkerfri']


-----------------------------------------------------------------------
Community 8 top 10 TF-IDF words:
-----------------------------------------------------------------------
lækker 0.012874166339660077
snack 0.012529077964163608
smag 0.012415276897778781
nyd 0.011868033412966257
vegansk 0.011327680399168252
skøn 0.011163738924474906
velsmagende 0.011012877587657677
minutter 0.010988265936565284
økologisk 0.01088739310935951
brug 0.010572550665090072
plantebaseret 0.010326026401666074
dejlig 0.01004392313183557
lavet 0.009691504776332566
lille 0.009678273719513415
giver 0.009385683714965095
perfekt 0.009197970040665793
cremet 0.00910896350071851
farten 0.009034037854516008
både 0.009010256366366492
chokolade 0.008855843265356446
-----------------------------------------------------------------------
Top 3 products in this community 8 is :
-----------------------------------------------------------------------
['smørepålæg vegansk', 'vegetarpålæg i skiver m. peber', 'creme fraiche dressing']


-----------------------------------------------------------------------
Community 17 top 10 TF-IDF words:
-----------------------------------------------------------------------
retten 0.015215713341744954
lækker 0.013997079768259625
salaten 0.013336375338590426
smag 0.01324437168160662
salling 0.012724314704983026
velsmagende 0.012041726369209341
nyd 0.012018938337123902
giver 0.01199595987845716
frisk 0.01173427944110398
sammen 0.01164876135365464
marmelade 0.011401070792561203
sød 0.011230645561764133
salat 0.011053527158199603
skøn 0.010878360425413989
kylling 0.010556798282599611
prøv 0.010105573142152562
lidt 0.009840610794337908
nem 0.00978518542036373
nemt 0.009589481711956454
perfekt 0.009566815919345996
-----------------------------------------------------------------------
Top 3 products in this community 17 is :
-----------------------------------------------------------------------
['appelsinjuice fra koncentrat', 'solbærmarmelade', 'appelsinmarmelade']


We see some distinct patterns that that differentiate each community. For instance, community 24 is clearly associated with ecological products, while community 19 appears to be centered around candy, with prominent words like 'mix,' 'liquorice,' and 'milk-chocolate.' To gain a better understanding of the defining characteristics of each community, we will examine the top TF-IDF words in word clouds for the top 9 communities.

In [98]:
data_product = pd.read_csv('data/df_clean_data_updated_comm_new1.csv', sep=";")

top_tfidf_words = {}
Top_com_products = {}
for c in Top_9_communities:
    Top_com_products[c] = [list(data_clean.loc[data_clean['product_id']==top_products[c][i]].name)[0] for i in range(3)]
    top_tfidf_words[c] = [word[0] for word in sorted(TF_IDF[c].items(), key=lambda x: x[1], reverse=True)] 
    

# Create a 3x3 grid of subplots
fig, axs = plt.subplots(3, 3, figsize=(16, 16))

# Iterate through each subplot position and community 
for (i, j), c in zip(product(range(3), repeat=2), Top_9_communities):
    data = " ".join(top_tfidf_words[c])
    wordcloud = WordCloud(width=400, height=400, background_color='white').generate(data)
    wordcloud.generate_from_frequencies(TF_IDF[c])
   
    axs[i, j].imshow(wordcloud, interpolation='bilinear')
    axs[i, j].set_title(f'Community {c}.\n Top products : \n {Top_com_products[c][0]} \n {Top_com_products[c][1]} \n {Top_com_products[c][2]}',fontsize = 12)
    axs[i, j].axis('off')

# Adjust layout to prevent overlap
plt.subplots_adjust(left=0.05, right=0.95, top=0.95, bottom=0.05, wspace=0.2, hspace=0.3)

plt.show()
No description has been provided for this image

Here, we can distinguish between the top 9 communities based on the words specific to each community. This enables us to pinpoint distinct buying patterns for customers visiting a Salling store. To further help with the distinguishing of each community we made an API call for chatGPT 3.5 to classify each community based on the top 100 TF-IDF words for that specific community (see specifics in the next section). The classifications for each community are:

  1. Community 16: 'Everyday'

This community's top TF-IDF words include: 'Salling', 'Delicious', 'taste', 'enjoy', 'together'. This makes sense that this is the largest community within the frequently-bought-together network, as one could imagine that consumers often buy groceries in bulk for weekly meal plans. According to this study [5] at least 57% of families plan their meals weekly (across the US). This wordcloud indicates that the majority of consumers purchase a diverse range of foods for everyday shopping, which is why this community is the largest. The products with the highest degrees within this community are two different brands of (agurk) cucumbers, and normal eggs (skrabeæg), which makes sense as these products are used in a lot of different dishes.

  1. Community 24: 'Organic'

This community's top TF-IDF words include: 'Salling', 'Ecological', 'Taste', 'Coffee'. This community represents a lot of what community 16 does of everyday bought items, but include ecological items. It aligns with the idea that consumers buying for meal plans as explained in community 16 but here organic options are chosen consistently. People who purchase some organic items tend to buy all organic items together, as supported ealier by our assortativity analysis. The products with the highest degrees in this community are eco oatmeal, eco eggs, and eco tomatoes.

  1. Community 18: 'Beverages'

This community's top TF-IDF words include: 'Gin', 'Wine', 'notes', 'taste'. This community describes consumers buying beverages, possibly for social gatherings, as indicated by the word 'Together', suggesting that these drinks are meant to be enjoyed with company. This community is also large because when buying drinks consumers tend to buy things together because drinks are a combination of different products mixed together. Hence, why the word 'Gin' is the largest in the wordcloud, as Gin & Tonic is a very common drink especially in Denmark. This is further shown by the top 3 products with higest degree in this community which is two types of tonic water and London dry gin.

  1. Community 19: 'Indulgence'

This community's top TF-IDF words include: 'liquorice', 'Mix', 'Milkchocolate', 'Chocolate'. As the fourth largest community in the network, it makes sense because stores often place candy and other indulgent treats together. The tendency with the urge of eating and purchasing various sugary snacks for movie nights or parties contributes to the size of this community. The top products in this community are 'liquorice hard candy', 'tappsy', and 'chocofants'.

  1. Community 3: 'Gourmet' / 'Tapas'

This community was classified by ChatGPT as 'Gourmet' but we thought that 'Tapas' was a more fitting category. This community's top TF-IDF words include: 'Olives', 'Cheese', 'Taste' and 'delicious'. Although this community is smaller and more niche, it represents items commonly bought together for tapas. The top frequently bought together items in this community are 'Bell pepper', 'Italian bresaola', and 'Italian serrano ham'.

  1. Community 22: 'Baking'

This community's top TF-IDF words include: 'Desert', 'Cake', 'Oetker' and 'Delicious'. This community would be associated with a buying patterns of a consumer that wants to bake.'Oetker' is a well-known brand for baking products, reinforcing this association. Baking often requires specific ingredients for recipes, which are typically bought together. This conclusion is further enforced when looking at the top products in this community which is 'powdered sugar', 'gelatin' and 'sugar'.

  1. Community 9: 'Energy'

This community's top TF-IDF words include: 'Energydrink', 'Faxe kondi', Rerfreshing'. It is characterized by the need for refreshments or energy-boosting drinks like energy drinks or sodas. A reason why this community is rather large might include the hypothesis that consumers typically buys refreshments to multiple people, such as guests, who have different preferences. However, a reason that it is not as large of a community might be because sometimes people visit the supermarket to buy just a single refreshment and not many other products. The top products in this community include: 'Gum with fruit flavor', 'Coca Cola', and 'Sportsdrink'

  1. Community 8: 'Vegan'

This community's top TF-IDF words include: 'Vegan', 'Delicious', 'Ecology' and 'Enjoy'. This community is distinguished by the vegan aspect. People who are vegan tend to buy other vegan and organic groceries together. A reason for the relative smaller size of this community might be due to the limited selection of vegan foods available in grocery stores and the tendency of vegan consumers to buy exclusively vegan products. The top products in this community are: 'vegan butter', 'vegan toppings' and 'creme fraiche dressing'.

  1. Community 17: 'Healthy'

This community's top TF-IDF words include: 'Salad', 'The dish', 'Enjoy'. This community is classified as a healthy community reflecting consumers who buy nutritious products like salads and healthy juices.The word 'innocent'—a brand known for healthy juices—also appears frequently. The top purchased products in this community are 'orange juice', and 'blackcurrants' and 'orange marmalade'. This suggests a pattern of buying items that contribute to a health-conscious diet.

3.2.4 Labeling of communities using OpenAI's API ¶

We set up a loop classifying each community based on the top 100 TF-IDF words with OpenAI's ChatGPT using their API to call the gpt-3.5-turbo model. with this we get a classification for each community in our network which can be seen below:

In [97]:
Communities = list(data_clean['community'].value_counts().index)
TF_IDF = {}
data_product = pd.read_csv('data/df_clean_data_updated_comm_new1.csv', sep=";")
TF = {}


for c in Communities:
    #Calculate token frequency within a community.
    Works = list(data_clean.loc[data_clean["community"] == c, "product_id"])

    token_doc_frequency = Counter()
    token_term_frequency = Counter()
    for idx, work in enumerate(Works):
        tokens = list(data_clean.loc[data_clean["product_id"] == work, "Tokens"])[0]
        token_doc_frequency.update(set(tokens))
        token_term_frequency.update(tokens)
        

    # Calculate IDF and TF-IDF scores for the top 9 largest communities
    TF_IDF[c] = {}
    TF[c] = {}
    for token, freq  in token_term_frequency.items():
        TF[c][token] = freq / token_term_frequency.total()
    
    for token, freq  in token_doc_frequency.items():
        idf = np.log(idx / freq)
        TF_IDF[c][token] = TF[c][token] * idf
        
        
top_tfidf_words = {}
for c in Communities:
    top_tfidf_words[c] = [word[0] for word in sorted(TF_IDF[c].items(), key=lambda x: x[1], reverse=True)] 

We chose to use GPT-3.5 for this task because it is a highly capable model for such low-level tasks and is cost-effective. We ran a simple prompt with instructions to use a single word to classify each community, and we use the top 100 words because of token limits and at some point the TF-IDF score for these word become too low. We realize that chatGPT does not have a way of setting a seed, so the output could vary when running this code again. To mitigate this we put the 'temperature' variable that controls creativity in the llm to 0. However, this still didn't give consistent results. To avoid spending excessive time on this part of the project, we used the most consistent community mappings from multiple outputs.

{16: 'Everyday', 24: 'Organic', 18: 'Beverages', 19: 'Indulgence', 3: 'Gourmet', 22: 'Baking', 9: 'Energy', 8: 'Vegan', 17: 'Organic', 4: 'Healthy', 14: 'Italian', 12: 'Asian', 20: 'Cooking', 11: 'Organic', 6: 'Snacks', 1: 'Beer', 10: 'Snacks', 0: 'Convenience', 21: 'Mexican', 15: 'Tea', 7: 'Indulgence', 13: 'Meat', 2: 'Coffee', 5: 'Everyday', 23: 'Coffee'}

In [94]:
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
pd.set_option('display.max_colwidth', None)

classify_dict = {}

for c in Communities:
    data = " ".join(top_tfidf_words[c][:100])

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role" : "system",
                "content" : f"You are a NLP and consumer buying pattern expert for groceries and give us a single word category the consumer based on some text about products."
                
            },
            
            {
                "role": "user",
                "content": f"Use a single word. The word should be general to what u think the consumer is buying for. e.g. 'Party', 'Guests', 'Everyday' classify the products based on these descriptions of the products: {data}."               
            },
            
            {
            "role": "assistant",
                "content": f"Ecology-based"
            },
            
        ],
            temperature=0,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )

    output = response.choices[0].message.content.strip()
    
    classify_dict[c] = output
classify_dict
Out[94]:
{16: 'Everyday',
 24: 'Organic',
 18: 'Beverages',
 19: 'Indulgence',
 3: 'Entertaining',
 22: 'Baking',
 9: 'Energy',
 8: 'Vegan',
 17: 'Gourmet',
 4: 'Indulgence',
 14: 'Cooking',
 12: 'Asian',
 20: 'Spices',
 11: 'Organic',
 6: 'Ecology-based',
 1: 'Beer',
 10: 'Snacks',
 0: 'Meal prep',
 21: 'Mexican',
 15: 'Tea',
 7: 'Indulgence',
 13: 'Meat',
 2: 'Coffee',
 5: 'Everyday',
 23: 'Coffee'}
In [95]:
data_product = pd.read_csv('data/df_clean_data_updated_comm_new1.csv', sep=";")

chosen_output_dict = {16: 'Everyday', 24: 'Organic', 18: 'Beverages', 19: 'Indulgence', 3: 'Gourmet', 22: 'Baking', 9: 'Energy', 8: 'Vegan', 17: 'Organic', 4: 'Healthy', 14: 'Italian', 12: 'Asian', 20: 'Cooking', 11: 'Organic', 6: 'Snacks', 1: 'Beer', 10: 'Snacks', 0: 'Convenience', 21: 'Mexican', 15: 'Tea', 7: 'Indulgence', 13: 'Meat', 2: 'Coffee', 5: 'Everyday', 23: 'Coffee'}


top_tfidf_words = {}
for c in Communities:
    top_tfidf_words[c] = [word[0] for word in sorted(TF_IDF[c].items(), key=lambda x: x[1], reverse=True)] 
    

fig, axs = plt.subplots(5, 5, figsize=(16, 16))

for (i, j), c in zip(product(range(5), repeat=2), Communities):
    data = " ".join(top_tfidf_words[c])
    wordcloud = WordCloud(width=400, height=400, background_color='white').generate(data)
    wordcloud.generate_from_frequencies(TF_IDF[c])
   
    axs[i, j].imshow(wordcloud, interpolation='bilinear')
    axs[i, j].set_title(f'Community {c}.\n varying ChatGPT label: {classify_dict[c]} \n chosen GPT label: {chosen_output_dict[c]}', fontsize=12)
    axs[i, j].axis('off')

# Adjust layout to prevent overlap
plt.subplots_adjust(left=0.05, right=0.95, top=0.95, bottom=0.05, wspace=0.2, hspace=0.3)

plt.show()
No description has been provided for this image

This approach provided an unbiased label for each community based on the top 100 words, a task that would be challenging and longsome for a human to do. Additionally, this method allowed us to quickly identify products that link communities, and revisiting what was done in the network analysis about betweenness centrality, we now see the following results:

In [96]:
dictionary1 = {16: 'Everyday', 24: 'Organic', 18: 'Beverages', 19: 'Indulgence', 3: 'Gourmet', 22: 'Baking', 9: 'Energy', 8: 'Vegan', 17: 'Organic', 4: 'Healthy', 14: 'Italian', 12: 'Asian', 20: 'Cooking', 11: 'Organic', 6: 'Snacks', 1: 'Beer', 10: 'Snacks', 0: 'Convenience', 21: 'Mexican', 15: 'Tea', 7: 'Indulgence', 13: 'Meat', 2: 'Coffee', 5: 'Everyday', 23: 'Coffee'}

c_dict = {'18364': {5, 10, 11, 16, 18, 20, 21, 22, 24}, '29439': {3, 4, 10, 11, 14, 16, 18, 19, 21, 22, 24}, '19721': {5, 7, 10, 13, 14, 16, 18, 19, 21, 22, 24}, '39353': {11, 16, 17, 22, 24}, '61090': {4, 11, 16, 18, 20, 24}, '53366': {3, 8, 11, 16, 20, 22, 24}, '39411': {11, 16, 18, 24}, '53365': {3, 9, 10, 16, 17, 18, 22, 24}, '41860': {3, 4, 9, 11, 16, 18, 19, 21, 22, 24}, '82376': {10, 14, 16, 18, 22, 24}, '18381': {4, 6, 11, 12, 16, 17, 18, 21, 24}, '119482': {16, 24}, '72008': {4, 11, 16, 17, 18, 24}, '51061': {5, 12, 16, 18, 22, 24}, '71507': {2, 3, 6, 8, 16, 18, 19, 24}}

# Replace integers with corresponding values from dictionary1 while maintaining order
c_dict_with_strings = {key: [dictionary1[item] for item in sorted(value)] for key, value in c_dict.items()}
new_dict_with_names = {data_clean[data_clean['product_id'] == int(key)]['name'].iloc[0]: value for key, value in c_dict_with_strings.items()}

print(new_dict_with_names)

# Convert dictionary to DataFrame
df = pd.DataFrame.from_dict(new_dict_with_names, orient='index')
df = df.map(lambda x: '' if x is None else x)

df
{'agurk': ['Gourmet', 'Healthy', 'Snacks', 'Organic', 'Italian', 'Everyday', 'Beverages', 'Indulgence', 'Mexican', 'Baking', 'Organic'], 'smørbar': ['Everyday', 'Indulgence', 'Snacks', 'Meat', 'Italian', 'Everyday', 'Beverages', 'Indulgence', 'Mexican', 'Baking', 'Organic'], 'mørk pålægschokolade 53% kakao øko': ['Organic', 'Everyday', 'Organic', 'Baking', 'Organic'], 'agurk øko': ['Healthy', 'Organic', 'Everyday', 'Beverages', 'Cooking', 'Organic'], 'æg m/l øko': ['Gourmet', 'Vegan', 'Organic', 'Everyday', 'Cooking', 'Baking', 'Organic'], 'finvalsede havregryn øko': ['Organic', 'Everyday', 'Beverages', 'Organic'], 'skrabeæg m/l': ['Gourmet', 'Energy', 'Snacks', 'Everyday', 'Organic', 'Beverages', 'Baking', 'Organic'], 'remoulade øko': ['Gourmet', 'Healthy', 'Energy', 'Organic', 'Everyday', 'Beverages', 'Indulgence', 'Mexican', 'Baking', 'Organic'], 'hvedemel øko': ['Snacks', 'Italian', 'Everyday', 'Beverages', 'Baking', 'Organic'], 'bananer': ['Healthy', 'Snacks', 'Organic', 'Asian', 'Everyday', 'Organic', 'Beverages', 'Mexican', 'Organic'], 'hakkede tomater øko': ['Everyday', 'Organic'], 'bananer 4 pak øko': ['Healthy', 'Organic', 'Everyday', 'Organic', 'Beverages', 'Organic'], 'peberfrugter røde': ['Everyday', 'Asian', 'Everyday', 'Beverages', 'Baking', 'Organic'], 'tonic': ['Coffee', 'Gourmet', 'Snacks', 'Vegan', 'Everyday', 'Beverages', 'Indulgence', 'Organic']}
Out[96]:
0 1 2 3 4 5 6 7 8 9 10
agurk Gourmet Healthy Snacks Organic Italian Everyday Beverages Indulgence Mexican Baking Organic
smørbar Everyday Indulgence Snacks Meat Italian Everyday Beverages Indulgence Mexican Baking Organic
mørk pålægschokolade 53% kakao øko Organic Everyday Organic Baking Organic
agurk øko Healthy Organic Everyday Beverages Cooking Organic
æg m/l øko Gourmet Vegan Organic Everyday Cooking Baking Organic
finvalsede havregryn øko Organic Everyday Beverages Organic
skrabeæg m/l Gourmet Energy Snacks Everyday Organic Beverages Baking Organic
remoulade øko Gourmet Healthy Energy Organic Everyday Beverages Indulgence Mexican Baking Organic
hvedemel øko Snacks Italian Everyday Beverages Baking Organic
bananer Healthy Snacks Organic Asian Everyday Organic Beverages Mexican Organic
hakkede tomater øko Everyday Organic
bananer 4 pak øko Healthy Organic Everyday Organic Beverages Organic
peberfrugter røde Everyday Asian Everyday Beverages Baking Organic
tonic Coffee Gourmet Snacks Vegan Everyday Beverages Indulgence Organic

The dataframe is sorted such that the products on the left, are ordered from highest betweenness centrality to lowest of the top 15 highest betweenness centrality in the entire frequently-bought-together network. A higher betweenness centrality means that this products is likely to be a bridge between different grocery shopping categories. This means that these products are likely for a consumer to bought toether with a varity of goods. A reason why the cucumber ranks highest could be its versatility. It is used in a variety of dishes, can be a snack, and is often one of the first items customers see when entering a supermarket. Additionally, its affordability makes it more likely to be added to a basket with a wide range of other products. All this could explain why it is more likely to be picked up into a basket with a lot of different stuff.

Based on the dataframe, we can see that several products have high betweenness centrality, meaning they are often bought together with products from other categories. These products include:

  • Cucumbers (agurk)
  • Butter (smørbar)
  • Organic cucumbers (agurk øko)
  • Organic eggs (æg m/l øko)
  • Grated oatmeal (finvalsede havregryn øko)
  • Remoulade (remoulade øko)
  • Wheat flour (hvedemel øko)

These products could be considered "bridge" products that connect different shopping categories. For example, organic cucumbers (agurk øko) appear in the "Healthy", "Organic", "Everyday", and "Beverages" communities. This suggests that organic cucumbers are often bought together with a variety of other products, such as healthy snacks, other organic items, everyday staples, and beverages.

Overall, the dataframe provides insights into the relationships between different product categories. By identifying products with high betweenness centrality, we can better understand how customers navigate through different categories when shopping. This information can be valuable for retailers in product placement, marketing, and inventory management.

4. Discussion. ¶

Through this project, we have gained significant insights into consumer behavior and purchasing patterns within the context of Danish grocery stores. Collecting the relevant data was a challenging task that required the use of new scraping tools and considerable effort. However, the process provided valuable learning opportunities and enhanced our understanding of webscraping.

In the analysis we learned from the assortativity scores and by inspecting the wordclouds for the communities that organic or vegan products often gets purchased with other organic or vegan products. The same goes for other "typical" combinations such as alcohol with soda / tonic and jam with juice (breakfast). The same can be said for items within the same category. We found that candy is usually bought with more candy, while things used in baking such as gelatine and powdered sugar is the top products in community 22 which was labelled 'baking'. It is notable that the category assortativity is high, which is understandable considering the argument that items that physically lie together in the store is bought together. Additionally, the proximity of items can make it tempting to purchase additional products near those one has already selected.

Another thing which surpriced us is the supremacy of the cucumber. We found that on across all parameters it is the most central item. The cucumber has the highest degree and is the most central node in all centrality measures. This aguably makes sense given its versatility as mentioned ealier that it is used in a variety of dishes, can be a snack, and is often one of the first items customers see when entering a supermarket. Additionally, its affordability makes it more likely to be added to a basket with a wide range of other products. But initially we wouldn't have guessed that this specific item would be the most interconnected.

Some things that we would like to improve, given more time, would be to expand on the centrality analysis regarding the community bridging of our top scoring nodes. Currently the node have a bridge if it just connects to one other node in another community. We would like to expand on this and maybe take the community-average of connections between the chosen node's own community and the community it bridges to, and only count the bridge if it have significantly more connections than the community average and thus implement a more statistical method for examining the bridges. We found that some of the bridging items did not make that much sense an example being 'remoulade øko', and an expansion of this study might reveal something more insightful. Additionally, the assortativity score of the price attribute might be somewhat misleading because the prices in the data are measured in kilo-prices, rather than the actual prices paid when shopping. Although kilo-prices are comparable, they do not always reflect the true cost to the consumer when cheking out. In the scraping process, including the exact item prices could have provided different insights, potentially offering a more accurate representation and then the two price-scores could have been compared in the analysis. In general, including more attributes in the scraping process, such as customer ratings, would have provided additional interesting data for analysis.

Another aspect we would like to investigate further is the high similarity among many products in the data. For instance, we have multiple nodes that are all 'orange-juices' and the name is just varying a bit based on the brand that produces the item. This applies to almost all the products. An addition to the project would be to include the different brands for all products as an attribute in the network and then investigate buying patterns among product brands as well.

We did consider doing sentiment analysis on the product descriptions but decided against it with the argument that we did not anticipate that a lot of people would read these online descriptions when going gorcery-shopping. Therefore we wouldn't initially think that the sentiment in the descriptions would have an effect on the bying patterns of simple groceries, but this would also have been an interesting analysis given more time.

From an ethical perspective, we noted that some findings within this field about consumer behaviour could be misused by retailers to enhance dark patterns in store designs. For example, placing the candy section where people queue, thus tempting them while they wait, is a common practice. Some results could be exploited to reinforce such manipulative tactics or to develop new ones, ultimately manipulating consumer behavior in ways that may not be in their best interest. In his type of research it is important to advocate for the responsible use of data. On the other hand, these inventory management tactics also likely significantly shaped the buying patterns we observed in this study, meaning that while consumers influences the inventory placements, the observed patterns we have seen also reflect the impact of strategic product placements where products in the same categories are being bought together.

5. References ¶

[1] Nordic Social. (n.d.). Købsadfærd. Nordic Social. Retrieved May 9, 2024, from https://nordicsocial.dk/kobsadfaerd/

[2] Boston Consulting Group. (2023). Danish consumer sentiment series: June 2023. https://web-assets.bcg.com/31/bc/0ebe31544e6bb4cdc22fba83a735/danish-consumer-sentiment-series-june-2023.pdf

[3] Network Science, by Albert-Laszlo Barabasi. http://networksciencebook.com/chapter/3

[4] Visible Network Labs. (2021, April 16). Understanding network centrality. https://visiblenetworklabs.com/2021/04/16/understanding-network-centrality/

[5] Ducrot, P., Méjean, C., Aroumougame, V. et al. Meal planning is associated with food variety, diet quality and body weight status in a large sample of French adults. Int J Behav Nutr Phys Act 14, 12 (2017). https://doi.org/10.1186/s12966-017-0461-7